Goto

Collaborating Authors

 kronecker factor




Scalable Thermodynamic Second-order Optimization

Donatella, Kaelan, Duffield, Samuel, Melanson, Denis, Aifer, Maxwell, Klett, Phoebe, Salegame, Rajath, Belateche, Zach, Crooks, Gavin, Martinez, Antonio J., Coles, Patrick J.

arXiv.org Artificial Intelligence

Many hardware proposals have aimed to accelerate inference in AI workloads. Less attention has been paid to hardware acceleration of training, despite the enormous societal impact of rapid training of AI models. Physics-based computers, such as thermodynamic computers, offer an efficient means to solve key primitives in AI training algorithms. Optimizers that normally would be computationally out-of-reach (e.g., due to expensive matrix inversions) on digital hardware could be unlocked with physics-based hardware. In this work, we propose a scalable algorithm for employing thermodynamic computers to accelerate a popular second-order optimizer called Kronecker-factored approximate curvature (K-FAC). Our asymptotic complexity analysis predicts increasing advantage with our algorithm as $n$, the number of neurons per layer, increases. Numerical experiments show that even under significant quantization noise, the benefits of second-order optimization can be preserved. Finally, we predict substantial speedups for large-scale vision and graph problems based on realistic hardware characteristics.


AdaFisher: Adaptive Second Order Optimization via Fisher Information

Gomes, Damien Martins, Zhang, Yanlei, Belilovsky, Eugene, Wolf, Guy, Hosseini, Mahdi S.

arXiv.org Artificial Intelligence

First-order optimization methods are currently the mainstream in training deep neural networks (DNNs). Optimizers like Adam incorporate limited curvature information by employing the diagonal matrix preconditioning of the stochastic gradient during the training. Despite their widespread, second-order optimization algorithms exhibit superior convergence properties compared to their first-order counterparts e.g. Adam and SGD. However, their practicality in training DNNs are still limited due to increased per-iteration computations and suboptimal accuracy compared to the first order methods. We present AdaFisher--an adaptive second-order optimizer that leverages a block-diagonal approximation to the Fisher information matrix for adaptive gradient preconditioning. AdaFisher aims to bridge the gap between enhanced convergence capabilities and computational efficiency in second-order optimization framework for training DNNs. Despite the slow pace of second-order optimizers, we showcase that AdaFisher can be reliably adopted for image classification, language modelling and stand out for its stability and robustness in hyperparameter tuning. We demonstrate that AdaFisher outperforms the SOTA optimizers in terms of both accuracy and convergence speed. Code available from \href{https://github.com/AtlasAnalyticsLab/AdaFisher}{https://github.com/AtlasAnalyticsLab/AdaFisher}


Structured Inverse-Free Natural Gradient: Memory-Efficient & Numerically-Stable KFAC for Large Neural Nets

Lin, Wu, Dangel, Felix, Eschenhagen, Runa, Neklyudov, Kirill, Kristiadi, Agustinus, Turner, Richard E., Makhzani, Alireza

arXiv.org Machine Learning

Second-order methods for deep learning--such as KFAC--can be useful for neural net training. However, they are often memory-inefficient and numerically unstable for low-precision training since their preconditioning Kronecker factors are dense, and require high-precision matrix inversion or decomposition. Consequently, such methods are not widely used for training large neural networks such as transformerbased models. We address these two issues by (i) formulating an inverse-free update of KFAC and (ii) imposing structures in each of the Kronecker factors, resulting in a method we term structured inverse-free natural gradient descent (SINGD). On large modern neural networks, we show that, in contrast to KFAC, SINGD is memory efficient and numerically robust, and often outperforms AdamW even in half precision. Hence, our work closes a gap between first-order and second-order methods in modern low precision training for large neural nets. The continuing success of deep learning (DL) is--to a large extent--powered by scaling up computational power (Thompson et al., 2020) to increase the number of neural network (NN) parameters that can be trained. Contemporary natural language processing (Radford et al., 2019; Brown et al., 2020; Touvron et al., 2023) and computer vision (Dehghani et al., 2023) models often consist of many billions of parameters, and will likely grow further in the future. To compensate for increasingly higher computational demands of training more parameters, many training pipelines use lower precision data types (Micikevicius et al., 2018) and memory-efficient first-order optimizers like SGD (Robbins & Monro, 1951) or Adam(W) (Kingma & Ba, 2015; Loshchilov & Hutter, 2019). One major obstacle why those methods are rarely used in deep learning is their higher memory consumption and iteration cost.


KronA: Parameter Efficient Tuning with Kronecker Adapter

Edalati, Ali, Tahaei, Marzieh, Kobyzev, Ivan, Nia, Vahid Partovi, Clark, James J., Rezagholizadeh, Mehdi

arXiv.org Artificial Intelligence

Fine-tuning a Pre-trained Language Model (PLM) on a specific downstream task has been a well-known paradigm in Natural Language Processing. However, with the ever-growing size of PLMs, training the entire model on several downstream tasks becomes very expensive and resource-hungry. Recently, different Parameter Efficient Tuning (PET) techniques are proposed to improve the efficiency of fine-tuning PLMs. One popular category of PET methods is the low-rank adaptation methods which insert learnable truncated SVD modules into the original model either sequentially or in parallel. However, low-rank decomposition suffers from limited representation power. In this work, we address this problem using the Kronecker product instead of the low-rank representation. We introduce KronA, a Kronecker product-based adapter module for efficient fine-tuning of Transformer-based PLMs. We apply the proposed methods for fine-tuning T5 on the GLUE benchmark to show that incorporating the Kronecker-based modules can outperform state-of-the-art PET methods.


Rethinking Exponential Averaging of the Fisher

Puiu, Constantin Octavian

arXiv.org Machine Learning

In optimization for Machine learning (ML), it is typical that curvature-matrix (CM) estimates rely on an exponential average (EA) of local estimates (giving EA-CM algorithms). This approach has little principled justification, but is very often used in practice. In this paper, we draw a connection between EA-CM algorithms and what we call a "Wake of Quadratic regularized models". The outlined connection allows us to understand what EA-CM algorithms are doing from an optimization perspective. Generalizing from the established connection, we propose a new family of algorithms, "KL-Divergence Wake-Regularized Models" (KLD-WRM). We give three different practical instantiations of KLD-WRM, and show numerically that these outperform K-FAC on MNIST.


Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks

Shi, Shaohuai, Zhang, Lin, Li, Bo

arXiv.org Artificial Intelligence

Distributed training with synchronous stochastic gradient descent (SGD) on GPU clusters has been widely used to accelerate the training process of deep models. However, SGD only utilizes the first-order gradient in model parameter updates, which may take days or weeks. Recent studies have successfully exploited approximate second-order information to speed up the training process, in which the Kronecker-Factored Approximate Curvature (KFAC) emerges as one of the most efficient approximation algorithms for training deep models. Yet, when leveraging GPU clusters to train models with distributed KFAC (D-KFAC), it incurs extensive computation as well as introduces extra communications during each iteration. In this work, we propose D-KFAC (SPD-KFAC) with smart parallelism of computing and communication tasks to reduce the iteration time. Specifically, 1) we first characterize the performance bottlenecks of D-KFAC, 2) we design and implement a pipelining mechanism for Kronecker factors computation and communication with dynamic tensor fusion, and 3) we develop a load balancing placement for inverting multiple matrices on GPU clusters. We conduct real-world experiments on a 64-GPU cluster with 100Gb/s InfiniBand interconnect. Experimental results show that our proposed SPD-KFAC training scheme can achieve 10%-35% improvement over state-of-the-art algorithms.


An iterative K-FAC algorithm for Deep Learning

Chen, Yingshi

arXiv.org Machine Learning

Kronecker-factored Approximate Curvature (K-FAC) method is a high efficiency second order optimizer for the deep learning. Its training time is less than SGD(or other first-order method) with same accuracy in many large-scale problems. The key of K-FAC is to approximates Fisher information matrix (FIM) as a block-diagonal matrix where each block is an inverse of tiny Kronecker factors. In this short note, we present CG-FAC -- an new iterative K-FAC algorithm. It uses conjugate gradient method to approximate the nature gradient. This CG-FAC method is matrix-free, that is, no need to generate the FIM matrix, also no need to generate the Kronecker factors A and G. We prove that the time and memory complexity of iterative CG-FAC is much less than that of standard K-FAC algorithm.


Pushing the limits of RNN Compression

Thakker, Urmish, Fedorov, Igor, Beu, Jesse, Gope, Dibakar, Zhou, Chu, Dasika, Ganesh, Mattina, Matthew

arXiv.org Machine Learning

Recurrent Neural Networks (RNN) can be difficult to deploy on resource constrained devices due to their size. As a result, there is a need for compression techniques that can significantly compress RNNs without negatively impacting task accuracy. This paper introduces a method to compress RNNs for resource constrained environments using Kronecker product (KP). KPs can compress RNN layers by 16-38x with minimal accuracy loss. We show that KP can beat the task accuracy achieved by other state-of-the-art compression techniques (pruning and low-rank matrix factorization) across 4 benchmarks spanning 3 different applications, while simultaneously improving inference run-time.